1 Introduction

As per Kaggle Website:

Ask a home buyer to describe their dream house, and they probably won’t begin with the height of the basement ceiling or the proximity to an east-west railroad. But this playground competition’s dataset proves that much more influences price negotiations than the number of bedrooms or a white-picket fence.

With 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa, this competition challenges you to predict the final price of each home.

Throughout this analysis we will perform feature engineering on predictors one by one in order to try to form the most accurate prediction for house prices. By the end of this analysis the models of ridge, lasso, and cubist will be performed in order to make the final predictions. Of the three, lasso ultimately performed the best and was chosen as the final submission.

2 Libraries and Importing Data

2.1 Libraries

library(ggplot2)
library(plyr,include.only = "revalue")
library(dplyr)
library(caret)
library(gridExtra)
library(e1071) #naive bayes
library(corrplot)
library(Metrics)
library(earth)
library(knitr)
options(scipen=999)

2.2 Loading Data Sets

train <- read.csv("train.csv",stringsAsFactors = FALSE)
test <- read.csv("test.csv",stringsAsFactors = FALSE)
# Id will not be used in this study so will be dropped from both the train and test set.
Id <- test$Id
train <- train %>% 
  select(-Id)
test <- test %>% 
  select(-Id)

2.2.1 Dimensions of Train

print(paste("The train data set contains ", nrow(train)," rows and ",ncol(train), "variables."))
## [1] "The train data set contains  1460  rows and  80 variables."

2.2.2 Dimensions of Test

print(paste("The test data set contains ", nrow(test)," rows and ",ncol(test), "variables."))
## [1] "The test data set contains  1459  rows and  79 variables."

2.2.3 Combining Data Sets for Easier Analysis

test$SalePrice <- NA
trainRows <- 1:nrow(train)
combined <- rbind(train,test)

3 Dependent Variable:

The dependent variable we are trying to predict is SalePrice: Sale prices of houses in Ames, Iowa. It exists in the train data set, but not in the test data set. It is skewed to the right with some very high trailing values. This will be addressed in the preparing for modeling section.

salePriceHisto <- ggplot(train,aes(x=SalePrice))+
  geom_histogram()+
  labs(title="SalePrice")
salePriceBox <- ggplot(train,aes(x=SalePrice))+
  geom_boxplot()+
  labs(title="SalePrice")

grid.arrange(salePriceHisto,salePriceBox,ncol=2)

4 NA Values

There exist many variables with at least one missing value in both the train and test sets.

# train dataset
for(name in names(train)){
  if(sum(is.na(train[,name]))>0){
    naSum=sum(is.na(train[,name]))
    print(paste(name," NA's:",naSum))
  }
}
## [1] "LotFrontage  NA's: 259"
## [1] "Alley  NA's: 1369"
## [1] "MasVnrType  NA's: 8"
## [1] "MasVnrArea  NA's: 8"
## [1] "BsmtQual  NA's: 37"
## [1] "BsmtCond  NA's: 37"
## [1] "BsmtExposure  NA's: 38"
## [1] "BsmtFinType1  NA's: 37"
## [1] "BsmtFinType2  NA's: 38"
## [1] "Electrical  NA's: 1"
## [1] "FireplaceQu  NA's: 690"
## [1] "GarageType  NA's: 81"
## [1] "GarageYrBlt  NA's: 81"
## [1] "GarageFinish  NA's: 81"
## [1] "GarageQual  NA's: 81"
## [1] "GarageCond  NA's: 81"
## [1] "PoolQC  NA's: 1453"
## [1] "Fence  NA's: 1179"
## [1] "MiscFeature  NA's: 1406"
# test dataset
for(name in names(test)){
  if(sum(is.na(test[,name]))>0){
    naSum = sum(is.na(test[,name]))
    print(paste(name," Na's:",naSum))
  }
}
## [1] "MSZoning  Na's: 4"
## [1] "LotFrontage  Na's: 227"
## [1] "Alley  Na's: 1352"
## [1] "Utilities  Na's: 2"
## [1] "Exterior1st  Na's: 1"
## [1] "Exterior2nd  Na's: 1"
## [1] "MasVnrType  Na's: 16"
## [1] "MasVnrArea  Na's: 15"
## [1] "BsmtQual  Na's: 44"
## [1] "BsmtCond  Na's: 45"
## [1] "BsmtExposure  Na's: 44"
## [1] "BsmtFinType1  Na's: 42"
## [1] "BsmtFinSF1  Na's: 1"
## [1] "BsmtFinType2  Na's: 42"
## [1] "BsmtFinSF2  Na's: 1"
## [1] "BsmtUnfSF  Na's: 1"
## [1] "TotalBsmtSF  Na's: 1"
## [1] "BsmtFullBath  Na's: 2"
## [1] "BsmtHalfBath  Na's: 2"
## [1] "KitchenQual  Na's: 1"
## [1] "Functional  Na's: 2"
## [1] "FireplaceQu  Na's: 730"
## [1] "GarageType  Na's: 76"
## [1] "GarageYrBlt  Na's: 78"
## [1] "GarageFinish  Na's: 78"
## [1] "GarageCars  Na's: 1"
## [1] "GarageArea  Na's: 1"
## [1] "GarageQual  Na's: 78"
## [1] "GarageCond  Na's: 78"
## [1] "PoolQC  Na's: 1456"
## [1] "Fence  Na's: 1169"
## [1] "MiscFeature  Na's: 1408"
## [1] "SaleType  Na's: 1"
## [1] "SalePrice  Na's: 1459"

4.1 Solving for Known NA Values

A lot of the NA values that exist are known NA values such as NA replacing the option of none. These are known from the Kaggle website and will be fixed now.

Alley: NA’s represent no alley access.

combined[is.na(combined$Alley),"Alley"] <- "noAccess"

Basement Quality, Basement Condition, Basement Exposure, Basement Finish type 1, Basement Finish type 2: NA’s represent no basement.

for(name in c("BsmtQual","BsmtCond","BsmtExposure","BsmtFinType1","BsmtFinType2")){
  combined[is.na(combined[,name]),name] <- "none"
}

Fireplace Quality: Na’s represent no fireplace.

combined[is.na(combined$FireplaceQu),"FireplaceQu"] <- "none"

Garage Type, Garage Finish, Garage Quality, Garage Condition: Na’s represent no garages.

for(name in c("GarageType","GarageFinish","GarageQual","GarageCond")){
  combined[is.na(combined[,name]),name] <- "none"
}

Pool Quality: NA’s represent no pool.

combined[is.na(combined$PoolQC),"PoolQC"] <- "noPool"

Fence: NA’s represent no fence.

combined[is.na(combined$Fence),"Fence"] <- "noFence"

Miscellaneous: NA’s represent no misc. feature.

combined[is.na(combined$MiscFeature),"MiscFeature"] <- "none"

4.2 Unknown NA Values

4.2.1 Remaining NA plot

Most of the remaining NA values in a variable are very small. Sale Price can be ignored as those are the known NA values from the test data set in which we are trying to solve for.

#creating na table
naDF <- data.frame(sapply(combined,function(x) sum(is.na(x))))
naDF$variable <- row.names(naDF)
names(naDF) <- c("naCount","Variable")
naDF <- naDF %>% 
  select(Variable,naCount) %>% 
  filter(naCount>0)
#creating na plot
ggplot(naDF,aes(x=reorder(Variable,desc(naCount)),y=naCount,fill=Variable))+
  geom_col(aes(),show.legend=FALSE)+
  labs(title="Remaining NA Values",
       subtitle = "Combined Data Set")+
  theme(axis.text.x=element_text(angle=45))+
  xlab("Variable")+
  ylab("NA Count")

We will now do a deep dive to understand if the remaining missing values can be filled in from the data. If not, we will use imputation.

4.2.2 Basement Variables Unknown NA’s

# counts of nas
combined %>% 
  select(BsmtFullBath,BsmtHalfBath,BsmtFinSF1,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF) %>% 
  sapply(.,function(x) sum(is.na(x)))
## BsmtFullBath BsmtHalfBath   BsmtFinSF1   BsmtFinSF2    BsmtUnfSF  TotalBsmtSF 
##            2            2            1            1            1            1
# rows with two BsmtFullBath Nas values

kable(combined[is.na(combined$BsmtFullBath),c("BsmtFullBath","BsmtHalfBath","BsmtFinSF1","BsmtFinSF2","BsmtUnfSF","TotalBsmtSF")]) 
BsmtFullBath BsmtHalfBath BsmtFinSF1 BsmtFinSF2 BsmtUnfSF TotalBsmtSF
2121 NA NA NA NA NA NA
2189 NA NA 0 0 0 0

For observation 2189 we can see the total basement square feet is 0, so both the basement bathroom variables will be set to 0. In observation 2121 all basement variables are NA so the assumption is going to made that this house has no basement.

# observation 2189
combined[2189,c("BsmtFullBath","BsmtHalfBath")] <- 0

combined[2121,c("BsmtFullBath","BsmtHalfBath","BsmtFinSF1","BsmtFinSF2","BsmtUnfSF","TotalBsmtSF")] <- 0

4.2.3 Garage NA Values

Of the remaining NA values in the garage variables, GarageYrBlt (Garage year built) has a significant amount of NA values with 159, while garage area and garage cars have only 1.

combined %>% select(GarageYrBlt,GarageArea,GarageCars) %>% 
  sapply(.,function(x) sum(is.na(x)))
## GarageYrBlt  GarageArea  GarageCars 
##         159           1           1
# all garage variables
garageVariables <- c("GarageArea","GarageCars","GarageCond","GarageFinish","GarageQual","GarageType","GarageYrBlt")

# row where GarageArea is na
combined[is.na(combined$GarageArea),garageVariables]
##      GarageArea GarageCars GarageCond GarageFinish GarageQual GarageType
## 2577         NA         NA       none         none       none     Detchd
##      GarageYrBlt
## 2577          NA

All garage variables of observation 2577 are pointing to that house not having a garage except garage type. Because of this, garage area and cars will be set to 0 and garage type will be changed from detached to none.

combined[2577,c("GarageArea","GarageCars")] <- 0
combined[2577,"GarageType"] <- "none"

The NA’s in GarageYrBlt (garage year built) will be addressed next. Since it is a numeric column there is no way to represent not having a garage such as “none”. The options are to divide the variables into sections and make it into a factor column with a none section, or keep it numeric and impute missing values and possibly later on make another variable signifying no garage. The latter option of keeping it numeric will be used.

print(paste("Of the combined rows,",(round(sum(combined$GarageYrBlt==combined$YearBuilt,na.rm=TRUE)/nrow(combined),2)*100),"%", "have the garage year built equalling the year the house was built."))
## [1] "Of the combined rows, 76 % have the garage year built equalling the year the house was built."

Since garage year built is the same as the house year built 76% of the time, the year the house was built will be used as the imputation value.

combined[is.na(combined$GarageYrBlt),"GarageYrBlt"] <- combined[is.na(combined$GarageYrBlt),"YearBuilt"]

Utilities will be dropped, all but one observation are the same.

table(combined$Utilities,exclude=FALSE)
## 
## AllPub NoSeWa   <NA> 
##   2916      1      2
combined <- combined %>% select(-Utilities)

4.2.4 Mode Function:

This will be used when doing mode imputation.

getmode <- function(v) {
   uniqv <- unique(v)
   uniqv[which.max(tabulate(match(v, uniqv)))]
}

Exterior1st and exterior2nd each have one NA value. They will be replaced with their modes.

print(paste("Exterior1st NA's:",sum(is.na(combined$Exterior1st))))
## [1] "Exterior1st NA's: 1"
print(paste("Exterior2nd NA's:",sum(is.na(combined$Exterior2nd))))
## [1] "Exterior2nd NA's: 1"
table(combined$Exterior1st,exclude=FALSE)
## 
## AsbShng AsphShn BrkComm BrkFace  CBlock CemntBd HdBoard ImStucc MetalSd Plywood 
##      44       2       6      87       2     126     442       1     450     221 
##   Stone  Stucco VinylSd Wd Sdng WdShing    <NA> 
##       2      43    1025     411      56       1
table(combined$Exterior2nd,exclude=FALSE)
## 
## AsbShng AsphShn Brk Cmn BrkFace  CBlock CmentBd HdBoard ImStucc MetalSd   Other 
##      38       4      22      47       3     126     406      15     447       1 
## Plywood   Stone  Stucco VinylSd Wd Sdng Wd Shng    <NA> 
##     270       6      47    1014     391      81       1
combined[is.na(combined$Exterior1st),"Exterior1st"] <- getmode(combined$Exterior1st)
combined[is.na(combined$Exterior2nd),"Exterior2nd"] <- getmode(combined$Exterior2nd)

MasVnrType and MasVnrArea contain 23 and 24 NA values. Of the 23 MasVnrType NA’s MasVnrArea is also NA. There is one MasVnrArea observation that is NA without MasVnrType being NA.

sum(is.na(combined$MasVnrArea))
## [1] 23
sum(is.na(combined$MasVnrType))
## [1] 24
combined[is.na(combined$MasVnrArea),c("MasVnrArea","MasVnrType")]
##      MasVnrArea MasVnrType
## 235          NA       <NA>
## 530          NA       <NA>
## 651          NA       <NA>
## 937          NA       <NA>
## 974          NA       <NA>
## 978          NA       <NA>
## 1244         NA       <NA>
## 1279         NA       <NA>
## 1692         NA       <NA>
## 1707         NA       <NA>
## 1883         NA       <NA>
## 1993         NA       <NA>
## 2005         NA       <NA>
## 2042         NA       <NA>
## 2312         NA       <NA>
## 2326         NA       <NA>
## 2341         NA       <NA>
## 2350         NA       <NA>
## 2369         NA       <NA>
## 2593         NA       <NA>
## 2658         NA       <NA>
## 2687         NA       <NA>
## 2863         NA       <NA>
combined[is.na(combined$MasVnrType) & !is.na(combined$MasVnrArea),c("MasVnrArea","MasVnrType")]
##      MasVnrArea MasVnrType
## 2611        198       <NA>

For MasVnrType mode imputation will be used and for MasVnrArea median imputation will be used.

#mode imputation
combined[is.na(combined$MasVnrArea),"MasVnrArea"] <- median(combined$MasVnrArea,na.rm=TRUE)

combined[is.na(combined$MasVnrType),"MasVnrType"] <- median(combined$MasVnrType,na.rm=TRUE)

KitchenQual, Functional, SaleType, MSZoning, and Electrical each have only a couple NA’s so mode imputation will be used.

#sum of na's
for(name in c("KitchenQual","Functional","SaleType","MSZoning","Electrical")){
  print(name)
  print(sum(is.na(combined[,name])))
}
## [1] "KitchenQual"
## [1] 1
## [1] "Functional"
## [1] 2
## [1] "SaleType"
## [1] 1
## [1] "MSZoning"
## [1] 4
## [1] "Electrical"
## [1] 1
#replacing na's with mode
for(name in c("KitchenQual","Functional","SaleType","MSZoning","Electrical")){
  combined[is.na(combined[,name]),name] <- getmode(combined[,name])
}

Lot frontage has a considerable amount of NA values with 486. Instead of using median imputation, a simple linear regression model will be built with the variable with the highest correlation, which is lot area.

#finds all numeric variables
numericVariable <- combined %>% 
  select_if(is.numeric)

#creates dataframe of each numeric variable correlation with LotFrontage
lotFrontageCorr <- as.data.frame(sapply(numericVariable, function(x) cor(combined$LotFrontage,x,use="complete.obs")))

names(lotFrontageCorr) <- "corValues"

lotFrontageCorr %>% 
  arrange(corValues) %>% 
  tail(2)
##             corValues
## LotArea     0.4898956
## LotFrontage 1.0000000

4.2.5 Simple Linear Regression Model for Imputation

# building SLR model
slrImputation <- lm(LotFrontage~LotArea,data=combined[!is.na(combined$LotFrontage),])

#results of model
slrImputation
## 
## Call:
## lm(formula = LotFrontage ~ LotArea, data = combined[!is.na(combined$LotFrontage), 
##     ])
## 
## Coefficients:
## (Intercept)      LotArea  
##    51.99510      0.00178
#imputing lot frontage using lot area
combined[is.na(combined$LotFrontage),"LotFrontage"] <- predict(slrImputation,combined[is.na(combined$LotFrontage),])

The NA values are now all resolved.

5 ggplot Shortcut Functions

5.1 Continuous ggplot Function

### plots for numeric variables
numericggplot <- function(variable){
  histo <- ggplot(combined,aes(x=combined[,variable]))+
    geom_histogram()+
    xlab(variable)+
    theme(axis.text=element_text(size=14),
          axis.title=element_text(size=14,face="bold"))+
    labs(title=variable)
  
trainData <- combined[!is.na(combined$SalePrice),]
  
  scattero <- ggplot(trainData,aes(x=trainData[,variable],y=SalePrice))+
    geom_point()+
    geom_smooth(method = "lm", se=FALSE, color="black")+
    xlab(variable)+
    ylab("SalePrice")+
    theme(axis.text=element_text(size=14),
          axis.title=element_text(size=14,face="bold"))+
    labs(title=variable)
  
grid.arrange(histo,scattero,ncol=2)
}

5.2 Discrete ggplot Function

discreteggplot <- function(name){
  
  discretePlot <- ggplot(combined,aes(x=combined[,name]))+
    geom_bar()+xlab(name)
  
trainData <- combined[!is.na(combined$SalePrice),]  

  discreteBox <- ggplot(trainData,aes(x=as.character(trainData[,name]),y=SalePrice))+
    geom_boxplot()+xlab(name)
  
grid.arrange(discretePlot,discreteBox,ncol=2)
}

6 Exploring the Data Variable by Variable

6.1 All Numeric Variables

For numeric variables in general if it is an ordered predictor we will make it an integer and if not we will make it a double. This will make the pre-processing (scaling, centering…) much easier.

6.1.1 Lot Frontage and Area

LotFrontage: Linear feet of street connected to property. There appears to be four outliers with lot frontage above 300 including observations 250, 336, 935, and 1299.

numericggplot("LotFrontage")

kable(combined[combined$LotFrontage>300,])
MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour LotConfig LandSlope Neighborhood Condition1 Condition2 BldgType HouseStyle OverallQual OverallCond YearBuilt YearRemodAdd RoofStyle RoofMatl Exterior1st Exterior2nd MasVnrType MasVnrArea ExterQual ExterCond Foundation BsmtQual BsmtCond BsmtExposure BsmtFinType1 BsmtFinSF1 BsmtFinType2 BsmtFinSF2 BsmtUnfSF TotalBsmtSF Heating HeatingQC CentralAir Electrical X1stFlrSF X2ndFlrSF LowQualFinSF GrLivArea BsmtFullBath BsmtHalfBath FullBath HalfBath BedroomAbvGr KitchenAbvGr KitchenQual TotRmsAbvGrd Functional Fireplaces FireplaceQu GarageType GarageYrBlt GarageFinish GarageCars GarageArea GarageQual GarageCond PavedDrive WoodDeckSF OpenPorchSF EnclosedPorch X3SsnPorch ScreenPorch PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold SaleType SaleCondition SalePrice
250 50 RL 334.9824 159000 Pave noAccess IR2 Low CulDSac Sev ClearCr Norm Norm 1Fam 1.5Fin 6 7 1958 2006 Gable CompShg Wd Sdng HdBoard BrkCmn 472 Gd TA CBlock Gd TA Gd Rec 697 Unf 0 747 1444 GasA Gd Y SBrkr 1444 700 0 2144 0 1 2 0 4 1 Gd 7 Typ 2 TA Attchd 1958 Fin 2 389 TA TA Y 0 98 0 0 0 0 noPool noFence Shed 500 6 2007 WD Normal 277000
336 190 RL 345.0560 164660 Grvl noAccess IR1 HLS Corner Sev Timber Norm Norm 2fmCon 1.5Fin 5 6 1965 1965 Gable CompShg Plywood Plywood None 0 TA TA CBlock TA TA Gd ALQ 1249 BLQ 147 103 1499 GasA Ex Y SBrkr 1619 167 0 1786 2 0 2 0 3 1 TA 7 Typ 2 Gd Attchd 1965 Fin 2 529 TA TA Y 670 0 0 0 0 0 noPool noFence Shed 700 8 2008 WD Normal 228950
935 20 RL 313.0000 27650 Pave noAccess IR2 HLS Inside Mod NAmes PosA Norm 1Fam 1Story 7 7 1960 2007 Flat Tar&Grv Wd Sdng Wd Sdng None 0 TA TA CBlock Gd TA Gd GLQ 425 Unf 0 160 585 GasA Ex Y SBrkr 2069 0 0 2069 1 0 2 0 4 1 Gd 9 Typ 1 Gd Attchd 1960 RFn 2 505 TA TA Y 0 0 0 0 0 0 noPool noFence none 0 11 2008 WD Normal 242000
1299 60 RL 313.0000 63887 Pave noAccess IR3 Bnk Corner Gtl Edwards Feedr Norm 1Fam 2Story 10 5 2008 2008 Hip ClyTile Stucco Stucco Stone 796 Ex TA PConc Ex TA Gd GLQ 5644 Unf 0 466 6110 GasA Ex Y SBrkr 4692 950 0 5642 2 0 2 1 3 1 Ex 12 Typ 3 Gd Attchd 2008 Fin 2 1418 TA TA Y 214 292 0 0 0 480 Gd noFence none 0 1 2008 New Partial 160000

LotArea: Lot size in square feet. Contains only four values above 100,000 including observations 250, 314, 336, and 707.

numericggplot("LotArea")

kable(combined[combined$LotArea>100000,])
MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour LotConfig LandSlope Neighborhood Condition1 Condition2 BldgType HouseStyle OverallQual OverallCond YearBuilt YearRemodAdd RoofStyle RoofMatl Exterior1st Exterior2nd MasVnrType MasVnrArea ExterQual ExterCond Foundation BsmtQual BsmtCond BsmtExposure BsmtFinType1 BsmtFinSF1 BsmtFinType2 BsmtFinSF2 BsmtUnfSF TotalBsmtSF Heating HeatingQC CentralAir Electrical X1stFlrSF X2ndFlrSF LowQualFinSF GrLivArea BsmtFullBath BsmtHalfBath FullBath HalfBath BedroomAbvGr KitchenAbvGr KitchenQual TotRmsAbvGrd Functional Fireplaces FireplaceQu GarageType GarageYrBlt GarageFinish GarageCars GarageArea GarageQual GarageCond PavedDrive WoodDeckSF OpenPorchSF EnclosedPorch X3SsnPorch ScreenPorch PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold SaleType SaleCondition SalePrice
250 50 RL 334.9824 159000 Pave noAccess IR2 Low CulDSac Sev ClearCr Norm Norm 1Fam 1.5Fin 6 7 1958 2006 Gable CompShg Wd Sdng HdBoard BrkCmn 472 Gd TA CBlock Gd TA Gd Rec 697 Unf 0 747 1444 GasA Gd Y SBrkr 1444 700 0 2144 0 1 2 0 4 1 Gd 7 Typ 2 TA Attchd 1958 Fin 2 389 TA TA Y 0 98 0 0 0 0 noPool noFence Shed 500 6 2007 WD Normal 277000
314 20 RL 150.0000 215245 Pave noAccess IR3 Low Inside Sev Timber Norm Norm 1Fam 1Story 7 5 1965 1965 Hip CompShg BrkFace BrkFace None 0 TA TA CBlock Gd TA Gd ALQ 1236 Rec 820 80 2136 GasW TA Y SBrkr 2036 0 0 2036 2 0 2 0 3 1 TA 8 Typ 2 Gd Attchd 1965 RFn 2 513 TA TA Y 0 0 0 0 0 0 noPool noFence none 0 6 2009 WD Normal 375000
336 190 RL 345.0560 164660 Grvl noAccess IR1 HLS Corner Sev Timber Norm Norm 2fmCon 1.5Fin 5 6 1965 1965 Gable CompShg Plywood Plywood None 0 TA TA CBlock TA TA Gd ALQ 1249 BLQ 147 103 1499 GasA Ex Y SBrkr 1619 167 0 1786 2 0 2 0 3 1 TA 7 Typ 2 Gd Attchd 1965 Fin 2 529 TA TA Y 670 0 0 0 0 0 noPool noFence Shed 700 8 2008 WD Normal 228950
707 20 RL 256.9366 115149 Pave noAccess IR2 Low CulDSac Sev ClearCr Norm Norm 1Fam 1Story 7 5 1971 2002 Gable CompShg Plywood Plywood Stone 351 TA TA CBlock Gd TA Gd GLQ 1219 Unf 0 424 1643 GasA TA Y SBrkr 1824 0 0 1824 1 0 2 0 2 1 Gd 5 Typ 2 TA Attchd 1971 Unf 2 739 TA TA Y 380 48 0 0 0 0 noPool noFence none 0 6 2007 WD Normal 302000

\(\textbf{Pick the next tab in order to see other variables.}\)

6.1.2 Misc Value

MiscVal: Value of miscellaneous feature. Most have 0, will likely make new predictor for whether or not a house has a miscellaneous feature.

numericggplot("MiscVal")

\(\textbf{Pick the next tab in order to see other variables.}\)

6.1.3 Year Built, Year Remodeled, Garage Year Built

YearBuilt: Original construction date. The newer the house, the higher the house price. However, there are a couple house built before 1900 that are very expensize. Investigating we can see that these houses had a recent remodel. Possibly an added feature later for when remodel does not equal year built.

numericggplot("YearBuilt")

kable(combined[combined$SalePrice>200000 & combined$YearBuilt<1900 & !is.na(combined$SalePrice),])
MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour LotConfig LandSlope Neighborhood Condition1 Condition2 BldgType HouseStyle OverallQual OverallCond YearBuilt YearRemodAdd RoofStyle RoofMatl Exterior1st Exterior2nd MasVnrType MasVnrArea ExterQual ExterCond Foundation BsmtQual BsmtCond BsmtExposure BsmtFinType1 BsmtFinSF1 BsmtFinType2 BsmtFinSF2 BsmtUnfSF TotalBsmtSF Heating HeatingQC CentralAir Electrical X1stFlrSF X2ndFlrSF LowQualFinSF GrLivArea BsmtFullBath BsmtHalfBath FullBath HalfBath BedroomAbvGr KitchenAbvGr KitchenQual TotRmsAbvGrd Functional Fireplaces FireplaceQu GarageType GarageYrBlt GarageFinish GarageCars GarageArea GarageQual GarageCond PavedDrive WoodDeckSF OpenPorchSF EnclosedPorch X3SsnPorch ScreenPorch PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold SaleType SaleCondition SalePrice
186 75 RM 90 22950 Pave noAccess IR2 Lvl Inside Gtl OldTown Artery Norm 1Fam 2.5Fin 10 9 1892 1993 Gable WdShngl Wd Sdng Wd Sdng None 0 Gd Gd BrkTil TA TA Mn Unf 0 Unf 0 1107 1107 GasA Ex Y SBrkr 1518 1518 572 3608 0 0 2 1 4 1 Ex 12 Typ 2 TA Detchd 1993 Unf 3 840 Ex TA Y 0 260 0 0 410 0 noPool GdPrv none 0 6 2006 WD Normal 475000
305 75 RM 87 18386 Pave noAccess Reg Lvl Inside Gtl OldTown Norm Norm 1Fam 2.5Fin 7 9 1880 2002 Gable CompShg CemntBd CmentBd None 0 TA TA BrkTil TA TA No Unf 0 Unf 0 1470 1470 GasA Ex Y SBrkr 1675 1818 0 3493 0 0 3 0 3 1 Gd 10 Typ 1 Ex Attchd 2003 Unf 3 870 TA TA Y 302 0 0 0 0 0 noPool noFence none 0 5 2008 WD Normal 295000
584 75 RM 75 13500 Pave noAccess Reg Lvl Inside Gtl OldTown Artery PosA 1Fam 2.5Unf 10 9 1893 2000 Gable CompShg Wd Sdng Wd Sdng None 0 Ex Ex BrkTil TA TA No Unf 0 Unf 0 1237 1237 GasA Gd Y SBrkr 1521 1254 0 2775 0 0 3 1 3 1 Gd 9 Typ 1 Gd Detchd 1988 Unf 2 880 Gd TA Y 105 502 0 0 0 0 noPool noFence none 0 7 2008 WD Normal 325000
748 70 RM 65 11700 Pave Pave IR1 Lvl Corner Gtl OldTown Norm Norm 1Fam 2Story 7 7 1880 2003 Mansard CompShg Stucco Stucco None 0 Gd TA Stone TA Fa No Unf 0 Unf 0 1240 1240 GasW TA N SBrkr 1320 1320 0 2640 0 0 1 1 4 1 Gd 8 Typ 1 Gd Detchd 1950 Unf 4 864 TA TA N 181 0 386 0 0 0 noPool noFence none 0 5 2009 WD Normal 265979

YearRemodAdd: Remodel date (same as construction date if no remodeling or additions).

numericggplot("YearRemodAdd")

print(paste(round(sum(combined$YearRemodAdd==combined$YearBuilt)/nrow(combined)*100,0),"% of houses did not have a remodel."))
## [1] "53 % of houses did not have a remodel."

GarageYrBlt: Year garage was built. Most of the time the garage was built when the house was built. There will be an added feature later on to depict whether or not the garage was added after the house was built. There is a clear typo with garage year built after 2200. This value will be replaced with the year that the house was built.

numericggplot("GarageYrBlt")

#percentage of time garage was built when house was built
print(paste(round(sum(combined$GarageYrBlt==combined$YearBuilt)/nrow(combined)*100,0),"% of garages built were built when the house was built."))
## [1] "81 % of garages built were built when the house was built."
#comparing garage year built with house year built
ggplot(combined,aes(x=YearBuilt,y=GarageYrBlt))+
  geom_point()

#removing typo
combined[combined$GarageYrBlt>2100,"GarageYrBlt"] <- combined[combined$GarageYrBlt>2100,"YearBuilt"]

\(\textbf{Pick the next tab in order to see other variables.}\)

6.1.4 Garage Cars and Area

GarageCars: Size of garage in car capacity.

GarageArea: Size of garage in square feet.

There exists a strong correlation between garage cars and garage area which makes sense. One of the predictors will likely be dropped later on.

discreteggplot("GarageCars")

numericggplot("GarageArea")

#
round(cor(combined$GarageCars,combined$GarageArea),2)
## [1] 0.89

\(\textbf{Pick the next tab in order to see other variables.}\)

6.1.5 MasVnrArea and MasVnrType

MasVnrArea: Masonry veneer area in square feet. There exists a lot of zeros signifying what should be none for MasVnrType: masonry veneer type. Instead, there are three observations with a labeled masonry veneer type that is not none, but have 0 for masonry veneer area. For these observations we will set the 0’s to the median masonry veneer area of those respective masonry veneer types.

numericggplot("MasVnrArea")

table(combined$MasVnrType)
## 
##  BrkCmn BrkFace    None   Stone 
##      25     879    1766     249
combined[combined$MasVnrArea==0 & combined$MasVnrType!="None",c("MasVnrArea","MasVnrType")]
##      MasVnrArea MasVnrType
## 689           0    BrkFace
## 1242          0      Stone
## 2320          0    BrkFace
# replacing 0's of BrkFace with median of MasVnrArea that have BrkFace
combined[combined$MasVnrArea==0 & combined$MasVnrType=="BrkFace","MasVnrArea"] <- median(combined[combined$MasVnrType=="BrkFace","MasVnrArea"])

# replacing 0 of Stone with median of MasVnrArea that has Stone
combined[combined$MasVnrArea==0 & combined$MasVnrType=="Stone","MasVnrArea"] <- median(combined[combined$MasVnrType=="Stone","MasVnrArea"])

\(\textbf{Pick the next tab in order to see other variables.}\)

6.1.6 Basement Area Predictors

BsmtFinSF1: Type 1 finished square feet.

BsmtFinSF2: Type 2 finished square feet.

BsmtUnfSF: Unfinished square feet of basement area.

TotalBsmtSF: Total square feet of basement area. Is the sum of finished square feet 1, finished square feet 2, and unfinished square feet. There is one outlier with a total basement square feet above 6000, observation 1299.

numericggplot("BsmtFinSF1")

numericggplot("BsmtFinSF2")

numericggplot("BsmtUnfSF")

numericggplot("TotalBsmtSF")

# Sum of all FinSF1, FinSF2, and UnfSF equal total basement sf.
sum(combined$TotalBsmtSF==combined$BsmtFinSF1+combined$BsmtFinSF2+combined$BsmtUnfSF)/nrow(combined)
## [1] 1
kable(combined[combined$TotalBsmtSF>6000,])
MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour LotConfig LandSlope Neighborhood Condition1 Condition2 BldgType HouseStyle OverallQual OverallCond YearBuilt YearRemodAdd RoofStyle RoofMatl Exterior1st Exterior2nd MasVnrType MasVnrArea ExterQual ExterCond Foundation BsmtQual BsmtCond BsmtExposure BsmtFinType1 BsmtFinSF1 BsmtFinType2 BsmtFinSF2 BsmtUnfSF TotalBsmtSF Heating HeatingQC CentralAir Electrical X1stFlrSF X2ndFlrSF LowQualFinSF GrLivArea BsmtFullBath BsmtHalfBath FullBath HalfBath BedroomAbvGr KitchenAbvGr KitchenQual TotRmsAbvGrd Functional Fireplaces FireplaceQu GarageType GarageYrBlt GarageFinish GarageCars GarageArea GarageQual GarageCond PavedDrive WoodDeckSF OpenPorchSF EnclosedPorch X3SsnPorch ScreenPorch PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold SaleType SaleCondition SalePrice
1299 60 RL 313 63887 Pave noAccess IR3 Bnk Corner Gtl Edwards Feedr Norm 1Fam 2Story 10 5 2008 2008 Hip ClyTile Stucco Stucco Stone 796 Ex TA PConc Ex TA Gd GLQ 5644 Unf 0 466 6110 GasA Ex Y SBrkr 4692 950 0 5642 2 0 2 1 3 1 Ex 12 Typ 3 Gd Attchd 2008 Fin 2 1418 TA TA Y 214 292 0 0 0 480 Gd noFence none 0 1 2008 New Partial 160000

\(\textbf{Pick the next tab in order to see other variables.}\)

6.1.7 Upstairs SF Area

X1stFlrSF: First Floor square feet.

X2ndFlrSF: Second floor square feet.

LowQualFinSF: Low quality finished square feet (all floors).

GrLivArea: Above grade (ground) living area square feet.

First floor square feet + second floor square feet + low quality square feet = above ground living area for each observation. Observation 1299 comes up again as an outlier, will likely be dropped later on.

numericggplot("X1stFlrSF")

numericggplot("X2ndFlrSF")

numericggplot("LowQualFinSF")

numericggplot("GrLivArea")

sum(combined$X1stFlrSF+combined$X2ndFlrSF+combined$LowQualFinSF==combined$GrLivArea)/nrow(combined)
## [1] 1
kable(combined[combined$GrLivArea>5000,])
MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour LotConfig LandSlope Neighborhood Condition1 Condition2 BldgType HouseStyle OverallQual OverallCond YearBuilt YearRemodAdd RoofStyle RoofMatl Exterior1st Exterior2nd MasVnrType MasVnrArea ExterQual ExterCond Foundation BsmtQual BsmtCond BsmtExposure BsmtFinType1 BsmtFinSF1 BsmtFinType2 BsmtFinSF2 BsmtUnfSF TotalBsmtSF Heating HeatingQC CentralAir Electrical X1stFlrSF X2ndFlrSF LowQualFinSF GrLivArea BsmtFullBath BsmtHalfBath FullBath HalfBath BedroomAbvGr KitchenAbvGr KitchenQual TotRmsAbvGrd Functional Fireplaces FireplaceQu GarageType GarageYrBlt GarageFinish GarageCars GarageArea GarageQual GarageCond PavedDrive WoodDeckSF OpenPorchSF EnclosedPorch X3SsnPorch ScreenPorch PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold SaleType SaleCondition SalePrice
1299 60 RL 313 63887 Pave noAccess IR3 Bnk Corner Gtl Edwards Feedr Norm 1Fam 2Story 10 5 2008 2008 Hip ClyTile Stucco Stucco Stone 796 Ex TA PConc Ex TA Gd GLQ 5644 Unf 0 466 6110 GasA Ex Y SBrkr 4692 950 0 5642 2 0 2 1 3 1 Ex 12 Typ 3 Gd Attchd 2008 Fin 2 1418 TA TA Y 214 292 0 0 0 480 Gd noFence none 0 1 2008 New Partial 160000
2550 20 RL 128 39290 Pave noAccess IR1 Bnk Inside Gtl Edwards Norm Norm 1Fam 1Story 10 5 2008 2009 Hip CompShg CemntBd CmentBd Stone 1224 Ex TA PConc Ex TA Gd GLQ 4010 Unf 0 1085 5095 GasA Ex Y SBrkr 5095 0 0 5095 1 1 2 1 2 1 Ex 15 Typ 2 Gd Attchd 2008 Fin 3 1154 TA TA Y 546 484 0 0 0 0 noPool noFence none 17000 10 2007 New Partial NA

\(\textbf{Pick the next tab in order to see other variables.}\)

6.1.8 Number of Bathrooms

BsmtFullBath: Basement full bathrooms.

BsmtHalfBath: Basement half bathrooms.

FullBath: Full bathrooms above grade.

HalfBath: Half baths above grade.

#BsmtFullBath plot
discreteggplot("BsmtFullBath")

#BsmtHalfBath plot
discreteggplot("BsmtHalfBath")

#FullBath
discreteggplot("FullBath")

#HalfBath
discreteggplot("HalfBath")

There is no variable for the total bathrooms in a house. This will be added later on in the feature engineering section.

\(\textbf{Pick the next tab in order to see other variables.}\)

6.1.9 Bedrooms, Kitchens, Total Rooms

BedroomAbvGr: Bedrooms above grade (does NOT include basement bedrooms).

KitchenAbvGr: Kitchens above grade.

TotRmsAbvGrd: Total rooms above grade (does not include bathrooms).

The total rooms does not equal bedrooms + kitchens. A variable can be made later to address rooms that are not kitchens or bedrooms.

ggplot(combined[!is.na(combined$SalePrice),],aes(x=as.factor(BedroomAbvGr),y=SalePrice))+geom_boxplot()+xlab("Bedrooms Above Ground")

ggplot(combined[!is.na(combined$SalePrice),],aes(x=as.factor(KitchenAbvGr),y=SalePrice))+geom_boxplot()+xlab("Kitchens Above Ground")

ggplot(combined[!is.na(combined$SalePrice),],aes(x=as.factor(TotRmsAbvGrd),y=SalePrice))+geom_boxplot()+xlab("Total Rooms Above Ground")

\(\textbf{Pick the next tab in order to see other variables.}\)

6.1.10 Fireplaces

Fireplaces: Number of fireplaces. There is only one house that contains 4 fireplaces and it occurs in the test set.

discreteggplot("Fireplaces")

\(\textbf{Pick the next tab in order to see other variables.}\)

6.1.11 Porch Variables

OpenPorchSF: Open porch area in square feet.

EnclosedPorch: Enclosed porch area in square feet.

X3SsnPorch: Three season porch area in square feet.

ScreenPorch: Screen porch area in square feet.

A predictor will be made later for whether or not a house has a porch.

numericggplot("OpenPorchSF")

numericggplot("EnclosedPorch")

numericggplot("X3SsnPorch")

numericggplot("ScreenPorch")

WoodDeckSF: Wood deck area in square feet.

numericggplot("WoodDeckSF")

\(\textbf{Pick the next tab in order to see other variables.}\)

6.1.12 Pool Area

PoolArea: Pool area in square feet. When the quality of the pool is labeled as “noPool”, the pool area should be zero. This is not the case for three observations. When comparing the size of the pool to the quality there seems to be no pattern. The values of quality for these observations will be filled in with good, which is the middle value of pool quality.

numericggplot("PoolArea")

#checking if relation between size and quality
combined[combined$PoolArea!=0,c("PoolArea","PoolQC")] %>% arrange(PoolArea)
##    PoolArea PoolQC
## 1       144     Ex
## 2       228     Ex
## 3       368 noPool
## 4       444 noPool
## 5       480     Gd
## 6       512     Ex
## 7       519     Fa
## 8       555     Ex
## 9       561 noPool
## 10      576     Gd
## 11      648     Fa
## 12      738     Gd
## 13      800     Gd
#setting the three pool quality values from nopool to Gd
combined[combined$PoolQC=="noPool" & combined$PoolArea!=0,c("PoolQC")] <- "Gd"

6.2 Converting to Double

Will make pre-processing section easier.

for(name in c("LotFrontage","LotArea","YearBuilt","YearRemodAdd","MasVnrArea","BsmtFinSF1","BsmtFinSF2","BsmtUnfSF","TotalBsmtSF","X1stFlrSF","X2ndFlrSF","LowQualFinSF","GrLivArea","BsmtFullBath","BsmtHalfBath","FullBath","HalfBath","BedroomAbvGr","KitchenAbvGr","TotRmsAbvGrd","Fireplaces","GarageYrBlt","GarageCars","GarageArea","WoodDeckSF","OpenPorchSF","EnclosedPorch","X3SsnPorch","ScreenPorch","PoolArea","MiscVal")){
  combined[,name] <- as.double(combined[,name])
}

6.3 Date/Time Predictors

MoSold: Month Sold (MM).

YrSold: Year Sold (YYYY).

Most houses seem to sell in the summer months but the average house price per month sold seems pretty even. There is no clear trend upwards or downwards per month over the five years.

ggplot(combined,aes(x=as.character(MoSold)))+
  geom_bar()+
  scale_x_discrete(limits=seq(1,12))+
  xlab("Months")

ggplot(combined[!is.na(combined$SalePrice),],aes(x=as.character(MoSold),y=SalePrice))+
  geom_boxplot()+
  scale_x_discrete(limits=seq(1,12))+
  xlab("Months")

From the previous charts we can see the end of the range of observations is in August 2010, which explains why 2010 has the least number of houses sold. There appears to be a gradual decrease in the median price of a house sold from 2006 to 2010. Both month sold and year sold will be converted to character predictors as there is no apparent ordering of sale price within them.

discreteggplot("YrSold")

combined$YrSold <- as.character(combined$YrSold)
combined$MoSold <- as.character(combined$MoSold)

6.4 Ordered Predictors

6.4.1 Overall Quality

OverallQual: Rates the overall material and finish of the house

  • 10 Very Excellent
  • 9 Excellent
  • 8 Very Good
  • 7 Good
  • 6 Above Average
  • 5 Average
  • 4 Below Average
  • 3 Fair
  • 2 Poor
  • 1 Very Poor

There is a clear increase in the sale of house prices as the quality of the house increases.

overallqualBar <- ggplot(combined,aes(x=OverallQual))+
  geom_bar()+
  scale_x_discrete(limits=seq(1,10))+
  xlab("Overall Quality")

overallqualBox <- ggplot(combined[!is.na(combined$SalePrice),],aes(x=as.character(OverallQual),y=SalePrice))+
  geom_boxplot()+
  scale_x_discrete(limits=seq(1,10))+
  xlab("Overall Quality")

grid.arrange(overallqualBar,overallqualBox,ncol=2)

\(\textbf{Pick the next tab in order to see other variables.}\)

6.4.2 Overall Condition

OverallCond: Rates the overall condition of the house.

  • 10 Very Excellent
  • 9 Excellent
  • 8 Very Good
  • 7 Good
  • 6 Above Average
  • 5 Average
  • 4 Below Average
  • 3 Fair
  • 2 Poor
  • 1 Very Poor

There is one outlier with overall condition 2 and another at overall condition 6 (observations 379 and 692).

ggplot(combined[!is.na(combined$SalePrice),],aes(x=as.character(OverallCond),y=SalePrice))+
  geom_boxplot()+
  scale_x_discrete(limits=seq(1,10))+
  xlab("Overall Condition")

kable(combined[combined$OverallCond==2 & combined$SalePrice>350000 & !is.na(combined$SalePrice),])
MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour LotConfig LandSlope Neighborhood Condition1 Condition2 BldgType HouseStyle OverallQual OverallCond YearBuilt YearRemodAdd RoofStyle RoofMatl Exterior1st Exterior2nd MasVnrType MasVnrArea ExterQual ExterCond Foundation BsmtQual BsmtCond BsmtExposure BsmtFinType1 BsmtFinSF1 BsmtFinType2 BsmtFinSF2 BsmtUnfSF TotalBsmtSF Heating HeatingQC CentralAir Electrical X1stFlrSF X2ndFlrSF LowQualFinSF GrLivArea BsmtFullBath BsmtHalfBath FullBath HalfBath BedroomAbvGr KitchenAbvGr KitchenQual TotRmsAbvGrd Functional Fireplaces FireplaceQu GarageType GarageYrBlt GarageFinish GarageCars GarageArea GarageQual GarageCond PavedDrive WoodDeckSF OpenPorchSF EnclosedPorch X3SsnPorch ScreenPorch PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold SaleType SaleCondition SalePrice
379 20 RL 88 11394 Pave noAccess Reg Lvl Corner Gtl StoneBr Norm Norm 1Fam 1Story 9 2 2010 2010 Hip CompShg VinylSd VinylSd Stone 350 Gd TA PConc Ex TA Av GLQ 1445 Unf 0 411 1856 GasA Ex Y SBrkr 1856 0 0 1856 1 0 1 1 1 1 Ex 8 Typ 1 Ex Attchd 2010 Fin 3 834 TA TA Y 113 0 0 0 0 0 noPool noFence none 0 6 2010 New Partial 394432
kable(combined[combined$OverallCond==6 & combined$SalePrice>600000 & !is.na(combined$SalePrice),])
MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour LotConfig LandSlope Neighborhood Condition1 Condition2 BldgType HouseStyle OverallQual OverallCond YearBuilt YearRemodAdd RoofStyle RoofMatl Exterior1st Exterior2nd MasVnrType MasVnrArea ExterQual ExterCond Foundation BsmtQual BsmtCond BsmtExposure BsmtFinType1 BsmtFinSF1 BsmtFinType2 BsmtFinSF2 BsmtUnfSF TotalBsmtSF Heating HeatingQC CentralAir Electrical X1stFlrSF X2ndFlrSF LowQualFinSF GrLivArea BsmtFullBath BsmtHalfBath FullBath HalfBath BedroomAbvGr KitchenAbvGr KitchenQual TotRmsAbvGrd Functional Fireplaces FireplaceQu GarageType GarageYrBlt GarageFinish GarageCars GarageArea GarageQual GarageCond PavedDrive WoodDeckSF OpenPorchSF EnclosedPorch X3SsnPorch ScreenPorch PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold SaleType SaleCondition SalePrice
692 60 RL 104 21535 Pave noAccess IR1 Lvl Corner Gtl NoRidge Norm Norm 1Fam 2Story 10 6 1994 1995 Gable WdShngl HdBoard HdBoard BrkFace 1170 Ex TA PConc Ex TA Gd GLQ 1455 Unf 0 989 2444 GasA Ex Y SBrkr 2444 1872 0 4316 0 1 3 1 4 1 Ex 10 Typ 2 Ex Attchd 1994 Fin 3 832 TA TA Y 382 50 0 0 0 0 noPool noFence none 0 1 2007 WD Normal 755000

\(\textbf{Pick the next tab in order to see other variables.}\)

6.4.3 External Quality and Condition

ExterQual: Evaluates the quality of the material on the exterior.

  • Ex Excellent
  • Gd Good
  • TA Average/Typical
  • Fa Fair
  • Po Poor

ExterCond: Evaluates the present condition of the material on the exterior.

  • Ex Excellent
  • Gd Good
  • TA Average/Typical
  • Fa Fair
  • Po Poor

External quality and external condition will both be made into ordered predictors.

noNone <- c('Po' = 0, 'Fa' = 1, 'TA' = 2, 'Gd' = 3, 'Ex' = 4)
#making into ordered
combined$ExterQual<-as.integer(revalue(combined$ExterQual, noNone))
table(train$ExterQual)
## 
##  Ex  Fa  Gd  TA 
##  52  14 488 906
combined$ExterCond<-as.integer(revalue(combined$ExterCond, noNone))

discreteggplot("ExterQual")

discreteggplot("ExterCond")

\(\textbf{Pick the next tab in order to see other variables.}\)

6.4.4 Basement Quality and Condition

BsmtQual: Evaluates the height of the basement.

  • Ex Excellent (100+ inches)
  • Gd Good (90-99 inches)
  • TA Typical (80-89 inches)
  • Fa Fair (70-79 inches)
  • Po Poor (<70 inches
  • NA No Basement

BsmtCond: Evaluates the general condition of the basement.

  • Ex Excellent
  • Gd Good
  • TA Typical - slight dampness allowed
  • Fa Fair - dampness or some cracking or settling
  • Po Poor - Severe cracking, settling, or wetness
  • NA No Basement

Both basement quality and basement condition will be made into ordered predictors.

table(combined$BsmtCond)
## 
##   Fa   Gd none   Po   TA 
##  104  122   82    5 2606
withNone <-c('none' = 0, 'Po' = 1, 'Fa' = 2, 'TA' = 3, 'Gd' = 4, 'Ex' = 5)

combined$BsmtQual <- as.integer(revalue(combined$BsmtQual,withNone))
combined$BsmtCond <- as.integer(revalue(combined$BsmtCond,withNone))

discreteggplot("BsmtQual")

discreteggplot("BsmtCond")

\(\textbf{Pick the next tab in order to see other variables.}\)

6.4.5 Kitchen Quality

KitchenQual: Kitchen quality.

  • Ex Excellent
  • Gd Good
  • TA Typical/Average
  • Fa Fair
  • Po Poor

Kitchen quality made into ordered.

combined$KitchenQual <- as.integer(revalue(combined$KitchenQual,noNone))

discreteggplot("KitchenQual")

\(\textbf{Pick the next tab in order to see other variables.}\)

6.4.6 Fireplace Quality

FireplaceQu: Fireplace quality.

  • Ex Excellent - Exceptional Masonry Fireplace
  • Gd Good - Masonry Fireplace in main level
  • TA Average - Prefabricated Fireplace in main living area or Masonry Fireplace in basement
  • Fa Fair - Prefabricated Fireplace in basement
  • Po Poor - Ben Franklin Stove
  • NA No Fireplace

Converted to ordered.

combined$FireplaceQu <- as.integer(revalue(combined$FireplaceQu,withNone))
discreteggplot("FireplaceQu")

\(\textbf{Pick the next tab in order to see other variables.}\)

6.4.7 Garage Quality and Condition

GarageQual: Garage quality.

  • Ex Excellent
  • Gd Good
  • TA Typical/Average
  • Fa Fair
  • Po Poor
  • NA No Garage

GarageCond: Garage condition.

  • Ex Excellent
  • Gd Good
  • TA Typical/Average
  • Fa Fair
  • Po Poor
  • NA No Garage

Converting both to ordered.

combined$GarageQual <- as.integer(revalue(combined$GarageQual,withNone))
combined$GarageCond <- as.integer(revalue(combined$GarageCond,withNone))

discreteggplot("GarageQual")

discreteggplot("GarageCond")

\(\textbf{Pick the next tab in order to see other variables.}\)

6.4.8 Pool Quality

PoolQC: Pool quality.

  • Ex Excellent
  • Gd Good
  • TA Average/Typical
  • Fa Fair
  • NA No Pool

Pool quality made into ordered. Can make feature later for whether or not a house has a pool.

poolOrdered <- c('noPool'=0,'Fa'=1,'TA'=2,'Gd'=3,'Ex'=4)
combined$PoolQC <- as.integer(revalue(combined$PoolQC,poolOrdered))

print(paste("Only ",sum(combined$PoolQC!=0)," of",nrow(combined)," houses have a pool."))
## [1] "Only  13  of 2919  houses have a pool."
discreteggplot("PoolQC")

\(\textbf{Pick the next tab in order to see other variables.}\)

6.4.9 Street

Street: Type of road access to property.

  • Grvl Gravel
  • Pave Paved

Made into ordered predictor with paved being superior to gravel.

combined[combined$Street=="Grvl","Street"] <- 0
combined[combined$Street=="Pave","Street"] <- 1
combined$Street <- as.integer(combined$Street)

discreteggplot("Street")

print(paste("There are only ",sum(combined$Street==0)," houses that have a gravel steet."))
## [1] "There are only  12  houses that have a gravel steet."

\(\textbf{Pick the next tab in order to see other variables.}\)

6.4.10 LandSlope

LandSlope: Slope of property.

  • Gtl Gentle slope
  • Mod Moderate Slope
  • Sev Severe Slope

Made into ordered with Gtl>Mod>Sev.

#order labeling
combined[combined$LandSlope=="Gtl","LandSlope"] <- 0
combined[combined$LandSlope=="Mod","LandSlope"] <- 1
combined[combined$LandSlope=="Sev","LandSlope"] <- 2

discreteggplot("LandSlope")

\(\textbf{Pick the next tab in order to see other variables.}\)

6.4.11 Heating quality, Central Air, Electrical

HeatingQC: Heating quality and condition.

  • Ex Excellent
  • Gd Good
  • TA Average/Typical
  • Fa Fair
  • Po Poor

CentralAir: Central air conditioning.

  • N No
  • Y Yes

Electrical: Electrical system.

  • SBrkr Standard Circuit Breakers & Romex
  • FuseA Fuse Box over 60 AMP and all Romex wiring (Average)
  • FuseF 60 AMP Fuse Box and mostly Romex wiring (Fair)
  • FuseP 60 AMP Fuse Box and mostly knob & tube wiring (poor)
  • Mix Mixed

Heating quality, central air, and electrical will all be changed into ordered.

noNone
## Po Fa TA Gd Ex 
##  0  1  2  3  4
combined$HeatingQC <- as.integer(revalue(combined$HeatingQC,noNone))

combined$CentralAir <- as.integer(revalue(combined$CentralAir,c('N'=0,'Y'=1)))

combined$Electrical <- as.integer(revalue(combined$Electrical,c('Mix'=0,'FuseP'=1,'FuseF'=2,'FuseA'=3,'SBrkr'=4)))


discreteggplot("HeatingQC")

discreteggplot("CentralAir")

discreteggplot("Electrical")

\(\textbf{Pick the next tab in order to see other variables.}\)

6.4.12 Masonry Veneer Type

MasVnrType: Masonry veneer type.

  • BrkCmn Brick Common
  • BrkFace Brick Face
  • CBlock Cinder Block
  • None None
  • Stone Stone

Made into ordered predictor with the sequential order of None, Brick Common, Brick Face, and then stone. Cinder block is not in the dataset so it wont be included, not sure where it would rank.

combined$MasVnrType <- as.integer(revalue(combined$MasVnrType,c('None'=0,'BrkCmn'=1,'BrkFace'=2,'Stone'=3)))
discreteggplot("MasVnrType")

\(\textbf{Pick the next tab in order to see other variables.}\)

6.4.13 Basement Exposure

BsmtExposure: Refers to walkout or garden level walls.

  • Gd Good Exposure
  • Av Average Exposure (split levels or foyers typically score average or above)
  • Mn Mimimum Exposure
  • No No Exposure
  • NA No Basement

Made into ordered based on level of exposure.

table(combined$BsmtExposure)
## 
##   Av   Gd   Mn   No none 
##  418  276  239 1904   82
combined$BsmtExposure <- as.integer(revalue(combined$BsmtExposure,c('none'=0,'No'=1,'Mn'=2,'Av'=3,'Gd'=4)))
discreteggplot("BsmtExposure")

\(\textbf{Pick the next tab in order to see other variables.}\)

6.4.14 Garage Finish

GarageFinish: Interior finish of the garage.

  • Fin Finished
  • RFn Rough Finished
  • Unf Unfinished
  • NA No Garage

Garage finish made into ordered.

combined$GarageFinish <- as.integer(revalue(combined$GarageFinish,c('none'=0,'Unf'=1,'RFn'=2,'Fin'=3)))

discreteggplot("GarageFinish")

\(\textbf{Pick the next tab in order to see other variables.}\)

6.4.15 Paved Drive

PavedDrive: Paved driveway.

  • Y Paved
  • P Partial Pavement
  • N Dirt/Gravel
combined$PavedDrive <- as.integer(revalue(combined$PavedDrive,c('N'=0,'P'=1,'Y'=2)))

discreteggplot("PavedDrive")

\(\textbf{Pick the next tab in order to see other variables.}\)

6.4.16 Fence

Fence: Fence quality.

  • GdPrv Good Privacy
  • MnPrv Minimum Privacy
  • GdWo Good Wood
  • MnWw Minimum Wood/Wire
  • NA No Fence

It appears having no fence translates to a higher sales price than having a fence. Converting this predictor to ordered with having no fence being the most valuable.

combined$Fence <- as.integer(revalue(combined$Fence,c('MnWw'=0,'GdWo'=1,'MnPrv'=2,'GdPrv'=3,'noFence'=4)))

discreteggplot("Fence")

6.5 Converting to Integers

Will make pre-processing section easier.

for(name in c("Street","OverallQual","OverallCond","ExterQual","ExterCond","BsmtQual","BsmtCond","BsmtExposure","HeatingQC","CentralAir","KitchenQual","FireplaceQu","GarageFinish","GarageQual","GarageCond","PavedDrive","PoolQC","Fence")){
  combined[,name] <- as.integer(combined[,name])
}

6.6 Character Predictors

6.6.1 MSSubClass

MSSubClass: Identifies the type of dwelling involved in the sale.

  • 20 1-STORY 1946 & NEWER ALL STYLES
  • 30 1-STORY 1945 & OLDER
  • 40 1-STORY W/FINISHED ATTIC ALL AGES
  • 45 1-1/2 STORY - UNFINISHED ALL AGES
  • 50 1-1/2 STORY FINISHED ALL AGES
  • etc (11 more labels)

The type of dwelling is currently a numeric predictor, it should be a factor. For now it will be made into a string and factor later on. Rare labels will also be addressed later on.

str(combined$MSSubClass)
##  int [1:2919] 60 20 60 70 60 50 20 60 50 190 ...
combined$MSSubClass <- as.character(combined$MSSubClass)

discreteggplot("MSSubClass")

\(\textbf{Pick the next tab in order to see other variables.}\)

6.6.2 MSZoning

MSZoning: Identifies the general zoning classification of the sale.

  • A Agriculture
  • C Commercial
  • FV Floating Village Residential
  • I Industrial
  • RH Residential High Density
  • RL Residential Low Density
  • RP Residential Low Density Park
  • RM Residential Medium Density
discreteggplot("MSZoning")

\(\textbf{Pick the next tab in order to see other variables.}\)

6.6.3 Alley

Alley: Type of alley access to property.

  • Grvl Gravel
  • Pave Paved
  • NA No alley access
discreteggplot("Alley")

\(\textbf{Pick the next tab in order to see other variables.}\)

6.6.4 Lotshape

LotShape: General shape of property.

  • Reg Regular
  • IR1 Slightly irregular
  • IR2 Moderately Irregular
  • IR3 Irregular

Could be an ordered predictor, box-plot doesn’t support this claim though so leaving as character.

discreteggplot("LotShape")

\(\textbf{Pick the next tab in order to see other variables.}\)

6.6.5 Land Contour

LandContour: Flatness of the property.

  • Lvl Near Flat/Level
  • Bnk Banked - Quick and significant rise from street grade to building
  • HLS Hillside - Significant slope from side to side
  • Low Depression
discreteggplot("LandContour")

\(\textbf{Pick the next tab in order to see other variables.}\)

6.6.6 Lot Config.

LotConfig: Lot configuration.

  • Inside Inside lot
  • Corner Corner lot
  • CulDSac Cul-de-sac
  • FR2 Frontage on 2 sides of property
  • FR3 Frontage on 3 sides of property
discreteggplot("LotConfig")

\(\textbf{Pick the next tab in order to see other variables.}\)

6.6.7 Neighborhood

Neighborhood: Physical locations within Ames city limits.

  • Blmngtn Bloomington Heights
  • Blueste Bluestem
  • BrDale Briardale
  • BrkSide Brookside
  • etc (21 other labels)

There exists 25 possible labels for neighborhood, some form of grouping will have to be used later on.

discreteggplot("Neighborhood")

\(\textbf{Pick the next tab in order to see other variables.}\)

6.6.8 Condition1 and Condition2

Condition1: Proximity to various conditions.

  • Artery Adjacent to arterial street
  • Feedr Adjacent to feeder street
  • Norm Normal
  • RRNn Within 200’ of North-South Railroad
  • RRAn Adjacent to North-South Railroad
  • PosN Near positive off-site feature–park, greenbelt, etc.
  • PosA Adjacent to postive off-site feature
  • RRNe Within 200’ of East-West Railroad
  • RRAe Adjacent to East-West Railroad

Condition2: Proximity to various conditions (if more than one is present).

  • Artery Adjacent to arterial street
  • Feedr Adjacent to feeder street
  • Norm Normal
  • RRNn Within 200’ of North-South Railroad
  • RRAn Adjacent to North-South Railroad
  • PosN Near positive off-site feature–park, greenbelt, etc.
  • PosA Adjacent to postive off-site feature
  • RRNe Within 200’ of East-West Railroad
  • RRAe Adjacent to East-West Railroad

Most of the houses have condition1 equal to condition2, one of the columns will likely be dropped.

print(paste(round(sum((combined$Condition1==combined$Condition2)/nrow(combined)*100),2),"% of houses have condition1 = condition2"))
## [1] "86.47 % of houses have condition1 = condition2"
discreteggplot("Condition1")

discreteggplot("Condition2")

\(\textbf{Pick the next tab in order to see other variables.}\)

6.6.9 Building Type and House Style

BldgType: Type of dwelling.

  • 1Fam Single-family Detached
  • 2FmCon Two-family Conversion; originally built as one-family dwelling
  • Duplx Duplex
  • TwnhsE Townhouse End Unit
  • TwnhsI Townhouse Inside Unit

HouseStyle: Style of dwelling.

  • 1Story One story
  • 1.5Fin One and one-half story: 2nd level finished
  • 1.5Unf One and one-half story: 2nd level unfinished
  • 2Story Two story
  • 2.5Fin Two and one-half story: 2nd level finished
  • 2.5Unf Two and one-half story: 2nd level unfinished
  • SFoyer Split Foyer
  • SLvl Split Level

House Style has some rare labels that may require grouping.

discreteggplot("BldgType")

discreteggplot("HouseStyle")

\(\textbf{Pick the next tab in order to see other variables.}\)

6.6.10 Roof Style and Material

RoofStyle: Type of roof.

  • Flat Flat
  • Gable Gable
  • Gambrel Gabrel (Barn)
  • Hip Hip
  • Mansard Mansard
  • Shed Shed

RoofMatl: Roof material.

  • ClyTile Clay or Tile
  • CompShg Standard (Composite) Shingle
  • Membran Membrane
  • Metal Metal
  • Roll Roll
  • Tar&Grv Gravel & Tar
  • WdShake Wood Shakes
  • WdShngl Wood Shingles

Roof material has four levels that only occur once. These levels will either need to be dropped or grouped.

discreteggplot("RoofStyle")

discreteggplot("RoofMatl")

table(combined$RoofMatl)
## 
## ClyTile CompShg Membran   Metal    Roll Tar&Grv WdShake WdShngl 
##       1    2876       1       1       1      23       9       7

\(\textbf{Pick the next tab in order to see other variables.}\)

6.6.11 Exterior1st and 2nd

Exterior1st: Exterior covering on house.

  • AsbShng Asbestos Shingles
  • AsphShn Asphalt Shingles
  • BrkComm Brick Common
  • BrkFace Brick Face
  • CBlock Cinder Block
  • etc (12 other levels)

Exterior2nd: Exterior covering on house (if more than one material).

  • AsbShng Asbestos Shingles
  • AsphShn Asphalt Shingles
  • BrkComm Brick Common
  • BrkFace Brick Face
  • CBlock Cinder Block
  • etc (12 other levels)

Most of the time exterior1st is the same as exterior2nd, one of the predictors will likely be dropped.

print(paste(round((sum(combined$Exterior1st==combined$Exterior2nd)/nrow(combined)*100),0),"% of the time exterior1st = exterior2nd."))
## [1] "85 % of the time exterior1st = exterior2nd."
exterior1stBar <- ggplot(combined,aes(x=Exterior1st))+
  geom_bar()+
  theme(axis.text.x = element_text(angle = 45))
  

exterior1stBox <- ggplot(combined[!is.na(combined$SalePrice),],aes(x=Exterior1st,y=SalePrice))+
  geom_boxplot()+
  theme(axis.text.x = element_text(angle = 45))

grid.arrange(exterior1stBar,exterior1stBox, ncol=2)

exterior2ndBar <- ggplot(combined,aes(x=Exterior2nd))+
  geom_bar()+
  theme(axis.text.x = element_text(angle = 45))
  

exterior2ndBox <- ggplot(combined[!is.na(combined$SalePrice),],aes(x=Exterior2nd,y=SalePrice))+
  geom_boxplot()+
  theme(axis.text.x = element_text(angle = 45))

grid.arrange(exterior2ndBar,exterior2ndBox, ncol=2)

\(\textbf{Pick the next tab in order to see other variables.}\)

6.6.12 Foundation

Foundation: Type of foundation.

  • BrkTil Brick & Tile
  • CBlock Cinder Block
  • PConc Poured Contrete
  • Slab Slab
  • Stone Stone
  • Wood Wood
discreteggplot("Foundation")

\(\textbf{Pick the next tab in order to see other variables.}\)

6.6.13 BsmtFinType 1 and 2

BsmtFinType1: Rating of basement finished area.

  • GLQ Good Living Quarters
  • ALQ Average Living Quarters
  • BLQ Below Average Living Quarters
  • Rec Average Rec Room
  • LwQ Low Quality
  • Unf Unfinshed
  • NA No Basement

BsmtFinType2: Rating of basement finished area (if multiple types).

  • GLQ Good Living Quarters
  • ALQ Average Living Quarters
  • BLQ Below Average Living Quarters
  • Rec Average Rec Room
  • LwQ Low Quality
  • Unf Unfinshed
  • NA No Basement

When plotting the sale price didn’t increase as would be expected when moving from no basement to good living quarters. Because of this, these two predictors will remain character instead of being made into ordered.

print(paste(round((sum(combined$BsmtFinType1==combined$BsmtFinType2)/nrow(combined))*100,0),"% of the time BsmtFinType1=BsmtFinType2"))
## [1] "32 % of the time BsmtFinType1=BsmtFinType2"
discreteggplot("BsmtFinType1")

discreteggplot("BsmtFinType2")

\(\textbf{Pick the next tab in order to see other variables.}\)

6.6.14 Heating

Heating: Type of heating.

  • Floor Floor Furnace
  • GasA Gas forced warm air furnace
  • GasW Gas hot water or steam heat
  • Grav Gravity furnace
  • OthW Hot water or steam heat other than gas
  • Wall Wall furnace

Nearly all houses have gas forced heating.

discreteggplot("Heating")

table(combined$Heating)
## 
## Floor  GasA  GasW  Grav  OthW  Wall 
##     1  2874    27     9     2     6

\(\textbf{Pick the next tab in order to see other variables.}\)

6.6.15 Functional

Functional: Home functionality (Assume typical unless deductions are warranted).

  • Typ Typical Functionality
  • Min1 Minor Deductions 1
  • Min2 Minor Deductions 2
  • Mod Moderate Deductions
  • Maj1 Major Deductions 1
  • Maj2 Major Deductions 2
  • Sev Severely Damaged
  • Sal Salvage only
discreteggplot("Functional")

\(\textbf{Pick the next tab in order to see other variables.}\)

6.6.16 Garage Type

GarageType: Garage location.

  • 2Types More than one type of garage
  • Attchd Attached to home
  • Basment Basement Garage
  • BuiltIn Built-In (Garage part of house - typically has room above garage)
  • CarPort Car Port
  • Detchd Detached from home
  • NA No Garage
discreteggplot("GarageType")

\(\textbf{Pick the next tab in order to see other variables.}\)

6.6.17 Misc Feature

MiscFeature: Miscellaneous feature not covered in other categories.

  • Elev Elevator
  • Gar2 2nd Garage (if not described in garage section)
  • Othr Other
  • Shed Shed (over 100 SF)
  • TenC Tennis Court
  • NA None
discreteggplot("MiscFeature")

print(paste(round(sum(combined$MiscFeature=="none")/nrow(combined)*100,0),"% of houses don't have a miscellaneous feature."))
## [1] "96 % of houses don't have a miscellaneous feature."

\(\textbf{Pick the next tab in order to see other variables.}\)

6.6.18 Sale Type and Condition

SaleType: Type of sale.

  • WD Warranty Deed - Conventional
  • CWD Warranty Deed - Cash
  • VWD Warranty Deed - VA Loan
  • New Home just constructed and sold
  • COD Court Officer Deed/Estate
  • Con Contract 15% Down payment regular terms
  • ConLw Contract Low Down payment and low interest
  • ConLI Contract Low Interest
  • ConLD Contract Low Down
  • Oth Other

SaleCondition: Condition of sale.

  • Normal Normal Sale
  • Abnorml Abnormal Sale - trade, foreclosure, short sale
  • AdjLand Adjoining Land Purchase
  • Alloca Allocation - two linked properties with separate deeds, typically condo with a garage unit
  • Family Sale between family members
  • Partial Home was not completed when last assessed (associated with New Homes)

Both are dominated by one label and have a number of rare labels.

discreteggplot("SaleType")

discreteggplot("SaleCondition")

7 Base Score

Now that the data is complete and visualized, a random forest model will be run to find baseline results. Throughout the rest of the analysis we will apply techniques to try to beat this score.

7.1 Train Control for RF

custom_summary = function(data, lev = NULL, model = NULL) {
library(Metrics)
out = rmsle(data[, "obs"], data[, "pred"])
names(out) = c("rmsle")
out
}
ctrl <- trainControl(method="cv",
                     number=10,
                     summaryFunction = custom_summary,
                     allowParallel = TRUE)

7.2 Random Forest Baseline Model

set.seed(123)
rfBase <- train(SalePrice~.,data=combined[!is.na(combined$SalePrice),],
                trControl=ctrl,
                metric="rmsle",
                maximize=FALSE,
                tuneGrid=expand.grid(mtry=seq(20,40,by=5)))

saveRDS(rfBase,"rfBase.rds")
rfBase <- readRDS("rfBase.rds")
min(rfBase$results$rmsle)
## [1] 0.1403416

8 Variable Importance and Correlation

8.1 Variable Importance

From the baseline random forest model we are also able to gather the variable importance in predicting sales price. The top 10 listed important variables are overall quality, above ground living area, garage cars, external quality, total basement square feet, kitchen quality, 1st floor square feet, garage area, year built, and basement quality.

varImp(rfBase)
## rf variable importance
## 
##   only 20 most important variables shown (out of 240)
## 
##              Overall
## OverallQual  100.000
## GrLivArea     63.966
## GarageCars    48.072
## ExterQual     41.407
## TotalBsmtSF   33.555
## KitchenQual   30.724
## X1stFlrSF     29.987
## GarageArea    29.869
## YearBuilt     28.402
## BsmtQual      26.117
## X2ndFlrSF     17.425
## BsmtFinSF1    16.837
## FullBath      13.471
## LotArea       11.408
## TotRmsAbvGrd  10.881
## FireplaceQu   10.153
## GarageYrBlt    9.458
## YearRemodAdd   8.702
## GarageFinish   8.260
## MasVnrArea     7.889

8.2 Correlation

The top 10 correlated predictors with sales price is overall quality, above ground living area, external quality, kitchen quality, garage cars, garage area, total basement square feet, first floor square feet, basement quality, and full bathrooms. 9 of the top 10 ranked predictors in the random forest model are also in the top 10 most correlated predictors.

numericVariables <- combined %>% 
  select_if(is.numeric)

SalePriceCorrelations <- as.data.frame(sapply(numericVariables,function(x) cor(x,numericVariables$SalePrice,use="complete.obs")))

names(SalePriceCorrelations) <- c("correlation")

SalePriceCorrelations %>% 
  filter(abs(correlation)>.5) %>% 
  arrange(desc(correlation))
##              correlation
## SalePrice      1.0000000
## OverallQual    0.7909816
## GrLivArea      0.7086245
## ExterQual      0.6826392
## KitchenQual    0.6595997
## GarageCars     0.6404092
## GarageArea     0.6234314
## TotalBsmtSF    0.6135806
## X1stFlrSF      0.6058522
## BsmtQual       0.5852072
## FullBath       0.5606638
## GarageFinish   0.5492468
## TotRmsAbvGrd   0.5337232
## YearBuilt      0.5228973
## FireplaceQu    0.5204376
## GarageYrBlt    0.5080433
## YearRemodAdd   0.5071010

Highly correlated predictors will be dropped in preparing for modeling section.

corrplot(cor(numericVariables,use="complete.obs"))

9 Feature Engineering

Now that our data is complete and we have visualized our data, we will begin creating our own variables in the hopes of achieving better model accuracy.

9.1 Created Features

9.1.1 hasRemodel

Has Remodel: A feature will be created for whether or not a house has a remodel. It will take place in any house where the remodel year does not equal the year the house was built. The houses with a remodel on average sell for slightly less, this might be because houses that had a remodel are typically over 20 years older on average.

combined <- combined %>% mutate(hasRemodel=ifelse(YearBuilt!=YearRemodAdd,1,0))
#made into character for now, factor later
combined$hasRemodel <- as.character(combined$hasRemodel)

# nearly half the houses have had a remodel
table(combined$hasRemodel)
## 
##    0    1 
## 1560 1359
ggplot(combined[!is.na(combined$SalePrice),],aes(x=hasRemodel,y=SalePrice))+
  geom_boxplot()+xlab("Has a Remodel")

#average year built with and without remodel
combined %>% group_by(hasRemodel) %>% 
  summarize(avgYearBuilt=mean(YearBuilt))
## # A tibble: 2 × 2
##   hasRemodel avgYearBuilt
##   <chr>             <dbl>
## 1 0                 1983.
## 2 1                 1958.
combined$hasRemodel <- as.integer(combined$hasRemodel)

\(\textbf{Pick the next tab in order to see other variables.}\)

9.1.2 Total Bathrooms

Total Bathrooms: There are four predictors for number of bathrooms, but no predictor for total bathrooms. Total Bathrooms will be created here.

#creating total bathroom predictor
combined <- combined %>% mutate(totalBathrooms=FullBath+HalfBath+BsmtFullBath+BsmtHalfBath)

ggplot(combined[!is.na(combined$SalePrice),],aes(x=as.character(totalBathrooms),y=SalePrice))+
  geom_boxplot()+ xlab("Total Bathrooms")

combined$totalBathrooms <- as.double(combined$totalBathrooms)

\(\textbf{Pick the next tab in order to see other variables.}\)

9.1.3 Other Rooms

Other Rooms: There was a predictor for kitchens and a predictor for bedrooms, but there was no predictor for rooms that were not kitchens or bedrooms. We will address this now.

#not bedrooms or kitchens = totalRooms-bedrooms-kitchens
combined <- combined %>% mutate(otherRooms=TotRmsAbvGrd-KitchenAbvGr-BedroomAbvGr)

discreteggplot("otherRooms")

combined$otherRooms <- as.double(combined$otherRooms)

\(\textbf{Pick the next tab in order to see other variables.}\)

9.1.4 Garage Built After House

Garage was Built After House: A binary predictor will be added if a garage was built after the house was built. It appears the sale price is lower if a garage was added after the house was built even though newly added garages on average are bigger. This can again be explained because the houses with added garages are on average much older.

# creating predictor
combined$newGarage <- ifelse(combined$GarageYrBlt!=combined$YearBuilt,1,0)
# most houses that have a garage were built with house
discreteggplot("newGarage")

# mean size and year built
combined %>% group_by(newGarage) %>% summarize(mean(GarageArea),
                                            mean(YearBuilt))
## # A tibble: 2 × 3
##   newGarage `mean(GarageArea)` `mean(YearBuilt)`
##       <dbl>              <dbl>             <dbl>
## 1         0               465.             1976.
## 2         1               507.             1949.
combined$newGarage <- as.integer(combined$newGarage)

\(\textbf{Pick the next tab in order to see other variables.}\)

9.1.5 hasPorch

Has a Porch: A predictor can be added for whether or not a house has a porch.

#if these 4 predictors = 0, then no porch
combined$hasPorch <- ifelse(combined$OpenPorchSF==0 & combined$X3SsnPorch== 0 & combined$ScreenPorch== 0 & combined$EnclosedPorch==0,0,1)

discreteggplot("hasPorch")

combined$hasPorch <- as.integer(combined$hasPorch)

\(\textbf{Pick the next tab in order to see other variables.}\)

9.1.6 hasPool

Has Pool: Creating a variable to show if a house has a pool. Appears sale price of a house with a pool is considerably higher.

# if pool area is 0, then no pool
combined$hasPool <- ifelse(combined$PoolArea==0,0,1)

discreteggplot("hasPool")

combined$hasPool <- as.integer(combined$hasPool)

\(\textbf{Pick the next tab in order to see other variables.}\)

9.1.7 hasMiscFeature

Has a Miscellaneous Feature: There is predictor describing which miscellaneous feature, but not a feature for whether or not a house has a miscellaneous feature in general.

table(combined$MiscFeature)
## 
## Gar2 none Othr Shed TenC 
##    5 2814    4   95    1
combined$hasMisc <- ifelse(combined$MiscFeature=="none",0,1)

combined$hasMisc <- as.integer(combined$hasMisc)

\(\textbf{Pick the next tab in order to see other variables.}\)

9.1.8 Age of House Split

Extra Age of House Predictor: Since it is unlikely that the depreciation of a house from 2005 to 2000 would be the same as a house from 1940 to 1935, a variable is created to try to address this. Using a mars model with one pivot it appears an added year in the age of a house has less meaning for houses built before 1971.

# Building mars model to find hinge point
yearBuiltBinned <- earth(SalePrice~YearBuilt,combined[!is.na(combined$SalePrice),],nprune = 2)

#building data frame of predicted values from the mars model
predictedValues <- data.frame(YearBuilt=combined[!is.na(combined$SalePrice),"YearBuilt"],predict(yearBuiltBinned,combined[!is.na(combined$SalePrice),]))
head(predictedValues)
##   YearBuilt SalePrice
## 1      2003  241708.1
## 2      1976  156793.0
## 3      2001  235418.1
## 4      1915  141067.9
## 5      2000  232273.1
## 6      1993  210258.1
#plotting what it looks like with the hinge function
predictedValues %>% 
  ggplot(.,aes(YearBuilt,SalePrice))+
  geom_smooth(color="red",size=3)+
  geom_point(data=combined[!is.na(combined$SalePrice),],aes(x=YearBuilt,y=SalePrice))+
  geom_point()+
  annotate("segment",x=1971,xend=1971,y=0,yend=350000,color="red")+
  labs(title="Hinge at 1971")

#adding the binary variable
combined$OldHouse <- ifelse(combined$YearBuilt<=1971,1,0)

combined$OldHouse <- as.integer(combined$OldHouse)

9.2 Outlier

Observation 1299 will be dropped as an outlier.

ggplot(combined[!is.na(combined$SalePrice),],aes(x=TotalBsmtSF,y=SalePrice))+
  geom_point()

kable(combined[1299,])
MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour LotConfig LandSlope Neighborhood Condition1 Condition2 BldgType HouseStyle OverallQual OverallCond YearBuilt YearRemodAdd RoofStyle RoofMatl Exterior1st Exterior2nd MasVnrType MasVnrArea ExterQual ExterCond Foundation BsmtQual BsmtCond BsmtExposure BsmtFinType1 BsmtFinSF1 BsmtFinType2 BsmtFinSF2 BsmtUnfSF TotalBsmtSF Heating HeatingQC CentralAir Electrical X1stFlrSF X2ndFlrSF LowQualFinSF GrLivArea BsmtFullBath BsmtHalfBath FullBath HalfBath BedroomAbvGr KitchenAbvGr KitchenQual TotRmsAbvGrd Functional Fireplaces FireplaceQu GarageType GarageYrBlt GarageFinish GarageCars GarageArea GarageQual GarageCond PavedDrive WoodDeckSF OpenPorchSF EnclosedPorch X3SsnPorch ScreenPorch PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold SaleType SaleCondition SalePrice hasRemodel totalBathrooms otherRooms newGarage hasPorch hasPool hasMisc OldHouse
1299 60 RL 313 63887 1 noAccess IR3 Bnk Corner 0 Edwards Feedr Norm 1Fam 2Story 10 5 2008 2008 Hip ClyTile Stucco Stucco 3 796 4 2 PConc 5 3 4 GLQ 5644 Unf 0 466 6110 GasA 4 1 4 4692 950 0 5642 2 0 2 1 3 1 4 12 Typ 3 4 Attchd 2008 3 2 1418 3 3 2 214 292 0 0 0 480 3 4 none 0 1 2008 New Partial 160000 0 5 8 0 1 1 0 0
# removing obs 1299
combined <- combined[-1299,]

10 Preparing for Modeling

10.1 Removing Highly Correlated Predictors

Predictors that have a correlation above .8 with another predictor will be dropped. The predictor that will be dropped is the one with lower correlation with the dependent variable sales price.

# creating data frame of correlations
allCorrelations <- as.data.frame(as.table(cor(numericVariables[,-52]))) #52 is sales price

#selecting only those with correlation above .8
highCorrelations <- allCorrelations %>% 
  filter(abs(Freq)>.8 & Freq!=1) %>% 
  arrange(desc(Freq))
# removing doubles such as garagecondition:garagequality and garagequality:garagecondition
noRepeats <- highCorrelations[seq(1,nrow(highCorrelations),by=2),]
noRepeats
##            Var1        Var2      Freq
## 1    GarageCond  GarageQual 0.9466563
## 3    GarageArea  GarageCars 0.8898902
## 5   FireplaceQu  Fireplaces 0.8617354
## 7   GarageYrBlt   YearBuilt 0.8607302
## 9        PoolQC    PoolArea 0.8580114
## 11 TotRmsAbvGrd   GrLivArea 0.8083544
## 13    X1stFlrSF TotalBsmtSF 0.8013759
#printing the name that has the lesser of the two correlations with sales price
for(i in 1:nrow(noRepeats)){
  #columns one correlation with sales price
  cor1=abs(cor(combined[,as.character(noRepeats[i,1])],combined$SalePrice,use="complete.obs"))#column two correlation with sales price
  cor2=abs(cor(combined[,as.character(noRepeats[i,2])],combined$SalePrice,use="complete.obs"))#printing lesser name
  ifelse(cor1<cor2,print(paste(cor1,noRepeats[i,1])),print(paste(cor2,noRepeats[i,2])))
}
## [1] "0.263251371200459 GarageCond"
## [1] "0.628469439926225 GarageArea"
## [1] "0.469822867956979 Fireplaces"
## [1] "0.50852158709765 GarageYrBlt"
## [1] "0.0994876681220653 PoolArea"
## [1] "0.536441973481322 TotRmsAbvGrd"
## [1] "0.625662261262474 X1stFlrSF"
#dropping those names from dataset
combined <- combined %>% select(-GarageCond,-GarageArea,-Fireplaces,-GarageYrBlt,-PoolArea,-TotRmsAbvGrd,-X1stFlrSF)

10.2 Dummy Variables

stringVariables <- combined %>% select_if(is.character)

makeDummy <- dummyVars(~.,data=stringVariables)

dummiedVars <- predict(makeDummy,stringVariables)

dummiedVars <- as.data.frame(dummiedVars)

10.3 Removing Labels Under 1% of Data

Removing labels that are in under 1% of the train data. 69 dummy variables were dropped.

onePercent <- nrow(combined[!is.na(combined$SalePrice),])/100 #14.59

print(paste("Before there were",ncol(dummiedVars),"dummied variables."))
## [1] "Before there were 216 dummied variables."
# removing all predictors that are under 1%
for(i in ncol(dummiedVars):1){
  #1459 is the number of train rows
  if(sum(dummiedVars[1:1459,i])<onePercent){
    dummiedVars <- dummiedVars[,-i]
  }
}

print(paste("Now there are",ncol(dummiedVars),"dummied variables."))
## [1] "Now there are 147 dummied variables."

10.4 Standardizing Predictors

Box-Cox and standardization will be applied to all numeric variables that are not ordered. Since we set all ordered variables to integers, we can call all non-ordered integers very easily. Caret package will be used to achieve this.

doubleVariables <- combined %>% select_if(is.double)
kable(head(doubleVariables,3))
LotFrontage LotArea YearBuilt YearRemodAdd MasVnrArea BsmtFinSF1 BsmtFinSF2 BsmtUnfSF TotalBsmtSF X2ndFlrSF LowQualFinSF GrLivArea BsmtFullBath BsmtHalfBath FullBath HalfBath BedroomAbvGr KitchenAbvGr GarageCars WoodDeckSF OpenPorchSF EnclosedPorch X3SsnPorch ScreenPorch MiscVal totalBathrooms otherRooms
65 8450 2003 2003 196 706 0 150 856 854 0 1710 1 0 2 1 3 1 2 0 61 0 0 0 0 4 4
80 9600 1976 1976 0 978 0 284 1262 0 0 1262 0 1 2 0 3 1 2 298 0 0 0 0 0 3 2
68 11250 2001 2002 162 486 0 434 920 866 0 1786 1 0 2 1 3 1 2 0 42 0 0 0 0 4 2
makeCenterScale <- preProcess(doubleVariables,method=c("BoxCox","center","scale"))

centerAndScaled <- predict(makeCenterScale,doubleVariables)

kable(head(centerAndScaled,3))
LotFrontage LotArea YearBuilt YearRemodAdd MasVnrArea BsmtFinSF1 BsmtFinSF2 BsmtUnfSF TotalBsmtSF X2ndFlrSF LowQualFinSF GrLivArea BsmtFullBath BsmtHalfBath FullBath HalfBath BedroomAbvGr KitchenAbvGr GarageCars WoodDeckSF OpenPorchSF EnclosedPorch X3SsnPorch ScreenPorch MiscVal totalBathrooms otherRooms
-0.1511310 -0.1025818 1.0521221 0.8982123 0.5303965 0.5983871 -0.2930296 -0.9339259 -0.4492400 1.2078793 -0.1011972 0.5676963 1.0896616 -0.2497213 0.7814475 1.2329198 0.1699279 -0.2077 0.3070447 -0.7402955 0.2016303 -0.3596087 -0.1033307 -0.2859394 -0.0895921 1.5274357 1.2601127
0.5160110 0.1482241 0.1483483 -0.4005552 -0.5681235 1.2090986 -0.2930296 -0.6291468 0.4924575 -0.7845412 -0.1011972 -0.3698165 -0.8188814 3.8211539 0.7814475 -0.7558364 0.1699279 -0.2077 0.3070447 1.6149019 -0.7029414 -0.3596087 -0.1033307 -0.2859394 -0.0895921 0.6547072 -0.4718745
-0.0119914 0.4599805 0.9847553 0.8497954 0.3398369 0.1044293 -0.2930296 -0.2879762 -0.3007951 1.2358759 -0.1011972 0.7018913 1.0896616 -0.2497213 0.7814475 1.2329198 0.1699279 -0.2077 0.3070447 -0.7402955 -0.0801216 -0.3596087 -0.1033307 -0.2859394 -0.0895921 1.5274357 -0.4718745

10.4.1 Log Transformation to Sale Price

When applying the log transformation to the sale price we can see it follows a near normal distribution. Also, since the final Kaggle submission is in RMSLE we can now use carets base numerical summary which is RMSE.

woLog <- ggplot(combined[!is.na(combined$SalePrice),],aes(x=SalePrice))+
  geom_histogram()

wLog <- combined[!is.na(combined$SalePrice),] %>% 
  mutate(logSalePrice=log(SalePrice)) %>% 
ggplot(.,aes(x=logSalePrice))+
         geom_histogram()+
  xlab("log(SalePrice)")

grid.arrange(woLog,wLog,ncol=2)

# making saleprice the log(saleprice)

SalePrice <- log(combined$SalePrice)

combined <- combined %>% select(-SalePrice)

Here the data will be put back together. This means combining the ordered predictors, center and scaled predictors, and dummied predictors back to one data set.

#selecting the remaining terms (ordered predictors)
orderPredictors <- combined %>% select_if(is.integer)

combined <- cbind(SalePrice,orderPredictors,centerAndScaled,dummiedVars)

dim(combined)
## [1] 2918  200

10.5 Non-Existent Labels

Having labels that only exist in train but not in test can lead to over fitting. Having labels that only occur in test but not in train can lead to errors. Therefor, labels that exist only in train but not in test and vice versa will be removed. Found that this problem does not occur after removing values that existed in less than 1% of the train data.

#ordering so saleprice is first
combined <- combined %>% select(SalePrice,everything())

#creating train and test sets
train <- combined[!is.na(combined$SalePrice),]

test <- combined[is.na(combined$SalePrice),]

#if all 0's for train data portion removing that predictor from combined
counter <- 0
for(i in ncol(train):2){
  if(sum(train[,i])==0){
    counter=counter+1
    print(names(train)[i])
    combined <- combined[,-i]
  }
  if(i==2){print(paste("For the train data:", counter," variables were dropped"))}
}
## [1] "For the train data: 0  variables were dropped"
 #if all 0's for the test data portion removing that predictor from combined
counter <- 0
for(i in ncol(test):2){
  if(sum(test[,i])==0){
    print(names(test)[i])
    combined <- combined[,-i]
    counter=counter+1
  }
  if(i==2){print(paste("For the test data:", counter," variables were dropped"))}
}
## [1] "For the test data: 0  variables were dropped"

The data is now pre-processed and in the format I wanted. Now, a train-test split can be performed before modeling.

train <- combined[!is.na(combined$SalePrice),]
test <- combined[is.na(combined$SalePrice),]

11 Modeling

Throughout this modeling phase we will perform ridge, lasso, and cubist models.

11.1 Train Control

#10 fold cross validation
ctrl <- trainControl(method="cv",
                     number=10,
                     verbose=FALSE)

11.2 Lasso

The lasso model is good at conquering multicollinearity which is likely a problem in this data set. It achieved a train RMSLE of .1179.

set.seed(432)
lassoMod <- train(SalePrice~.,data=train,
                  method="glmnet",
                  trControl=ctrl,
                  tuneGrid=expand.grid(alpha=1,lambda=seq(0.001,.1,by=.001)))

saveRDS(lassoMod,"lassoMod.rds")
lassoMod <- readRDS("lassoMod.rds")
min(lassoMod$results$RMSE)
## [1] 0.1179436

11.3 Ridge

The ridge model did not perform as well as the lasso model.

set.seed(223)
ridgeMod <- train(SalePrice~.,data=train,
                  method="glmnet",
                  trControl=ctrl,
                  tuneGrid=expand.grid(alpha=0,lambda=seq(.001,.1,by=.001)))

saveRDS(ridgeMod,"ridgeMod.rds")
ridgeMod <- readRDS("ridgeMod.rds")
min(ridgeMod$results$RMSE)
## [1] 0.1211724

11.4 Cubist

The cubist model appears to perform ever so slightly better than the lasso model, but is much more complex.

set.seed(1234)
cubistMod <- train(SalePrice~.,data=train,
                    method="cubist",
                    trControl=ctrl,
                    tuneGrid=expand.grid(committees=seq(70,100,by=5),neighbors=c(7,8,9)))
saveRDS(cubistMod,"cubistMod.rds")
cubistMod <- readRDS("cubistMod.rds")
min(cubistMod$results$RMSE)
## [1] 0.1174491

11.5 Comparing Models

modelResults <- resamples(list(lasso=lassoMod,
                               ridge=ridgeMod,
                            cubist=cubistMod))
bwplot(modelResults,metric="RMSE")

12 Final Submission

I submitted both the cubist and lasso model predictions. The lasso model slightly outperformed the cubist model and achieved a RMSLE of .12273.

#exp to transform back to original units
outputDF <- data.frame(ID=Id,SalePrice=exp(predict(lassoMod,test)))

write.csv(x = outputDF,file = "submission717lasso.csv",row.names=FALSE)